Use case¶

The aim of faltwerk is to help during spatial exploratory data analysis of protein structures. There are many use cases, but here we will focus on positively selected residues in great apes in the iron transporter "transferrin", identified by Barber et al. (2014, Science, link). In this beautiful study, they can trace this back to escape from iron piracy:

We show that the iron transport protein transferrin is engaged in ancient and ongoing evolutionary conflicts with TbpA, a transferrin surface receptor from bacteria.

Below we assume the state of knowledge Barber et al. had when they embarked on the study, which means we just found that several sites in transferrin are positively selected. We will use faltwerk to see if there is any spatial component to this finding, and generate hypotheses, that we can then follow up on (in the lab, for example, as did the authors). In fact, we assume even less. While the structures of transferrin and the transferrin-TbpA complex had been resolved by the time of the study, we will not use them, but predicted them de novo using AlphaFold2 (AF2).

Disclaimer: Remember that ALL analyses below are based on the protein sequence ONLY, which continuous to amaze me. However, therefore, one has to be careful to not place too much confidence on any individual technique, and seek othogonal evidence for any finding that might come up.

In [1]:
import json
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

import altair as alt
import numpy as np
import py3Dmol
import pandas as pd

from libpysal.weights import KNN, DistanceBand
from spreg import OLS, Probit
# https://github.com/pysal/spreg/blob/d464bbbc3c8601f1ca1989f4756967dca3a83179/spreg/probit.py#L704

from faltwerk.biochem import solvent_access
from faltwerk.geometry import is_close, get_complex_interface, distance_to_positions, get_alpha_carbon_atoms
from faltwerk.io import load_conserved, load_bfactor_column, load_conserved
from faltwerk.models import Fold, Complex, Binding, AlphaFold
from faltwerk.stats import find_hotspots, cluster
from faltwerk.vis import Layout, plot_alphafold
from faltwerk.utils import flatten

For the prediction of the transferrin protein structure we used AF2 as implemented in "ColabFold":

  • code
  • manuscript

By default, AF2 predicts 5 models and we can choose the best accoding to some metric, where "pLDDT" is most commonly used. Values above about 70% indicate that the model is quite confident about the prediction. A superposition of the models can give an idea about the variance in the prediction. Behind the scenes, faltwerk will take the prediction results, rename chains, and align the structures (the latter using foldseek).

In [2]:
fp = 'data/alphafold2/transferrin/'
af = AlphaFold(fp)
Best model (pLDDT): test_08df6_unrelaxed_rank_1_model_3.pdb
Align remaining models to best and rename
Aligning test_08df6_unrelaxed_rank_2_model_4.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb
Renaming chain A to B
Aligning test_08df6_unrelaxed_rank_3_model_5.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb
Renaming chain A to C
Aligning test_08df6_unrelaxed_rank_4_model_2.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb
Renaming chain A to D
Aligning test_08df6_unrelaxed_rank_5_model_1.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb
Renaming chain A to E
In [4]:
plot_alphafold(af)

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

Out[4]:
<py3Dmol.view at 0x176f61340>

We continue with the best model. To be more flexible in the visualisation of proteins we'll introduce the Layer object.

In [5]:
model = af.best

# What annotation tracks are present?
model.annotation.keys()
# dict_keys(['position', 'plddt'])

Layout(model, panel_size=(400, 300)).geom_ribbon('plddt', palette='rainbow_r').render().show()
# Blue is good
# Pick any color or color palette from matplotlib:
# https://matplotlib.org/stable/tutorials/colors/colormaps.html

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

The model seems uncertain about how to fold the N-termus, but other than that, we seem to have a decent structure prediction.

Proteins are a diverse bunch of molecules, and individual residues can have very different properties based on their (relative) position in space. faltwerk implements some convenience functions to quickly annotate such features, e.g. "solvent access".

In [6]:
_ = model.annotate_('access', solvent_access(model))
Layout(model).geom_surface('access', palette='viridis').render().show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

We now begin our scientific inquiry. Using several methods, Barber et al. arrived at several positions in the protein that seem to be under positive selection. What caught their attention was the spatial organisation of these positions. They seemed to be constrained to two regions on the linear protein sequence. Spatial clustering suggests that some part of the protein is more "relevant" to a biological process than others. The properties of this protein part can suggest what biological process could be responsible. For example, positive selection across binding sites could be caused by an evolutionary arm's race between two organisms, as is the case here (for detailed information please refer to the above mentioned manuscript).

In [7]:
original = [153, 253, 382, 434, 435, 436, 439, 558, 574, 575, 576, 591, 592, 593, 614, 617, 619, 625]
# -1 bc/ positions from are 1-based (publication) but Python has us work with 0-based indexing
barber2014 = [i-1 for i in original]
# Turn positions into a vector with one for each position under selection and 0 otherwise.
selection = [1 if i in barber2014 else 0 for i in range(len(model))]

We can approach the question "is there spatial signal in sites under positive selection" in two ways.

  1. spatial autocorrelation (features matter)
  2. point pattern analysis (coordinates matter)

In (1) we try to find regions where a "patch" around a residue in our structure has some property, but by pure chance we would not expect such an aggregation in space to occur (random distibution). We supply a p-value and a maximum false discovery rate (FDR), to adjust for multiple testing. Below, we define the neighborhood of a residue as a sphere with a radius of eight Angstrom. We will use the Getis-Ord statistic to find "hotspots", and in a one-sided test we are only interested in hotspots with more selection than is expected (you could be interested in negative/ purifying selection, in which case you run the two-sided tests).

For (2) we use point density to cluster points that are close in space. Here, we take all residues identified as part of a hotspot of positive selection and cluster them. Basically, this segments hotspots, which can be useful in automated analyses. Imagine that two biological processes act on your favourite protein, one on a binding site and the other on the active center. If you encounter this case (unknowingly) in an automated analysis, you want to get two clusters, on which you can perform subsequent computations individually. For clustering faltwerk implements HDBSCAN and Markow chain clustering (MCL).

In [9]:
p = 0.05
FDR = 0.05

# (1)
hotspots = find_hotspots(
    model,
    selection,
    method='getis_ord',
    angstrom=8,
    false_discovery_rate=FDR,
    test_two_sided=False)

# (2)
clusters = cluster(model, hotspots, min_cluster_size=5)

# Annotate model
model.annotate_many_({
    'selection': selection,
    'hotspots': hotspots,
    'clusters': clusters})

A note on visualisation. To build up a figure, we instantiate a Layout. We can then choose to only select() certain parts of the protein. Below, we only care about the carbon atoms of certain residues on chain A of the protein structure. We can then pass this selection to the "style" layers which start with geom_ (ggplot2 anyone?). You can layer on as many selections and styles as you want. By default they are placed in the first panel.

In [10]:
# Visualise
ly = Layout(model, panel_size=(200, 200), grid=(1, 3), linked=True)

pos = ly.select(residues=barber2014, elements=['CA'], chain='A')

ly.geom_ribbon(color='#ffffff')
ly.geom_sphere(selection=pos, color='black')

ly.geom_surface('hotspots', palette='binary', panel=(0, 1))
ly.geom_surface('clusters', palette='Set2_r', panel=(0, 2))

ly.render().show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

We find two hotspots on the C-terminal "lobe" of the protein. While there are two positively selected positions on the N-terminal lobe, they don't seem to cluster. Are hotspots useful? Maybe. Imagine you only have a few species representatives to analyse. How likely do you find all sites that natural selection acts on? Hotspots "color in" regions that are spatially plausible, and it might or might not make sense to base further calculations on them rather than the original residues. However, faltwerk gives you the tools to quickly try both.

Proteins often bind stuff. Transferrin for example is an iron (Fe2+) shuttle. Let's visualise which residues bind iron, maybe there is some spatial relation. Note that faltwerk here uses the method implemented by Kiefl et al. in anvio based on work by Kobren and Singh (2018) called "InteracDome":

  • https://merenlab.org/2020/07/22/interacdome/
  • https://www.biorxiv.org/content/10.1101/2022.03.02.482602v1
  • https://academic.oup.com/nar/article/47/2/582/5232439
In [12]:
# TODO: adjust path to Pfam domains, see install instructions
pfam = 'path/to/pfam_v31/Pfam-A.hmm'

b = Binding(model, 'representable')
b.predict_binding_(pfam)
# b.domains
# b.interactions
binding = b.get_binding('PF00405.16', 'FE')
fe = [i for i, j in enumerate(binding) if j > .5]

ly = Layout(model)
# select
pos = ly.select(residues=barber2014, elements=['CA'], chain='A')
fe_ = ly.select(residues=fe)  # carbon atoms of chain A are selected by default
# style the layer cake
ly.geom_ribbon(color='#ffffff')
ly.geom_sphere(selection=pos, color='black')
ly.geom_ribbon(selection=fe_, color='red')

ly.render().show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

As expected, we find one Fe binding site per lobe. Note how residues distant on the linear sequence fold into spatial proximity to "hold" the iron molecule. Also, the sites under positive selection have no clear relationship to this center. But nevermind, let's just add the distance of each residue to the closest binding site residue as a regressor for later. faltwerk offers several functions that make this kind of "geometry" calculation as easy as:

In [13]:
model.annotate_('distance_to_fe', distance_to_positions(model, fe))

We want to add more features that could explain the selection pattern, so let's keep collecting. Transferrin is known to bind several proteins. Barber et al. used such protein complex structures from crystallography experiments, but AF2 predicts complexes surprisingly well. We will add "distance to binding interface" as a feature for our statistical work later. In terms of science, this TbpA is the iron pirate referred to in the manuscript.

In [14]:
fp = 'data/3V8X/complex/test_cacad_unrelaxed_rank_1_model_3.pdb'
cx = Complex(fp)

interface = get_complex_interface(cx, 10)
distance_to_interface = distance_to_positions(model, interface)
model.annotate_('distance_to_interface', distance_to_interface)

ly = Layout(cx)
B = ly.select(chain='B')
ly.geom_ribbon(selection=B).render().show()
<Chain id=B>
<Chain id=C>

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

We can also add that were calculated through other programs and prediction models, for example:

  • per-residue values from external programs such as predicted binding sites
  • a multiple sequence alignment, where we are interested in how conserved residues are
In [15]:
fp = 'data/transferrin_binding.pdb'
dmasif = [i for i in load_bfactor_column(fp)]

fp = 'data/conservation/isoforms.linsi.faa'  # protein MSA in fasta format
conserved = load_conserved(fp)

model.annotate_many_({
    'interface_prediction': dmasif,
    'conserved': conserved
})
In [16]:
ly = Layout(model, grid=(1, 4), panel_size=(200, 200), linked=False)
# unlink, to turn individually

pos = ly.select(residues=barber2014, elements=['CA'], chain='A')
fe_ = ly.select(residues=fe)  # carbon atoms of chain A are selected by default

ly.geom_ribbon(color='#ffffff')
ly.geom_sphere(selection=pos, color='black')

# Pick color(map)s from matplotlib
# https://matplotlib.org/stable/tutorials/colors/colormaps.html
ly.geom_surface('hotspots', palette='binary', panel=(0, 1))
ly.geom_surface('distance_to_interface', panel=(0, 2))
ly.geom_surface('interface_prediction', panel=(0, 3))

ly.render().show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

We can see that residues close to the TbpA binding interface co-locate with our selection hotspots (Panel 3/4), while there is poor correspondence between the (more general) interface prediction based on multiple residue properties (Panel 4/4).

Now we can start to check out some ideas obtained from eyeballing the data. The model annotation can be easily turned into a pandas dataframe and explored using a wide range of visualisation tools, such as here altair.

In [17]:
df = pd.DataFrame.from_dict(flatten(model.annotation, expected_track_length=len(model)))

alt.Chart(df).mark_boxplot(extent=1.5).encode(
    x='selection:O',
    y='distance_to_interface:Q',
)
Out[17]:
In [18]:
df
Out[18]:
position plddt access selection hotspots clusters distance_to_fe distance_to_interface interface_prediction conserved
0 1 42.63 1.000000 0 0 0 66.680168 68.355728 81.53 1.0000
1 2 44.54 0.995968 0 0 0 64.177109 64.654755 82.79 1.0000
2 3 45.96 1.000000 0 0 0 62.222004 63.754223 87.52 1.0000
3 4 49.76 0.688679 0 0 0 60.561222 64.366493 88.10 1.0000
4 5 50.62 0.788732 0 0 0 57.877182 61.342251 89.19 0.9000
... ... ... ... ... ... ... ... ... ... ...
693 694 89.39 0.183099 0 0 0 16.412100 27.741705 84.40 0.8053
694 695 90.23 0.142132 0 0 0 13.978737 26.945194 88.93 1.0000
695 696 77.40 0.229839 0 0 0 10.931679 24.255344 72.21 0.5579
696 697 64.85 0.379032 0 0 0 12.241253 26.450377 76.98 0.7316
697 698 45.53 0.889706 0 0 0 11.424846 28.426435 60.73 0.3000

698 rows × 10 columns

Once hypotheses have been generated, we need to test them. The typical regression framework assumes independent datapoints, which we clearly violate due to the spatial dependency (to neighboring residues are more likely to share properties than would be expected by chance). Luckily, there is a spatial regression framework (pysal), which we can easily interface with.

Where utter patternlessness or randomness prevails, nothing is predictable. -- "Real Patterns", D. Dennett, 1991

First we will test several variables against a binary dependent one, namely whether a residue is under positive selection or not. We could use the positions that have been identified by Barber et al., but here we will use the spatial hotspots identified above. Alternatively, we could just use single clusters in more granular analyses or if we suspect the hotspots to emerge due to multiple selective forces (cluster 1 from protein A, 2 from protein B etc.).

In [19]:
points = list(get_alpha_carbon_atoms(model, only_coords=True))

# Define what is a "neighborhood", either using k-nearest neighbors ...
w = KNN.from_array(points, k=8)
# ... or euclidean distance
angstrom = 8
w = DistanceBand(points, angstrom, p=2, binary=True)
w.transform='r'

y = np.array(df['hotspots'])
columns = ['distance_to_interface', 'distance_to_fe', 'access']
x = np.array(df[columns])

m = Probit(y, x, w=w, name_y='selection', name_x=columns)
np.around(m.betas, decimals=6)
# constant, then independent variables in order of appearance
Out[19]:
array([[-0.790137],
       [-0.059364],
       [-0.015249],
       [ 1.159931]])
In [20]:
print(m.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: CLASSIC PROBIT ESTIMATOR
-------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   selection                Number of Observations:         698
% correctly predicted: 94.99
Log-Likelihood       : -120.0490
LR test              : 37.6178
LR test (p-value)    : 0.0000

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      -0.7901370       0.3086193      -2.5602320       0.0104602
distance_to_interface      -0.0593643       0.0125276      -4.7386875       0.0000022
      distance_to_fe      -0.0152488       0.0184589      -0.8260964       0.4087494
              access       1.1599307       0.3803480       3.0496563       0.0022910
------------------------------------------------------------------------------------

MARGINAL EFFECTS
Method: Mean of individual marginal effects
------------------------------------------------------------------------------------
            Variable           Slope       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
              access       0.1044783       0.0352032       2.9678680       0.0029987
      distance_to_fe      -0.0013735       0.0016062      -0.8551417       0.3924727
distance_to_interface      -0.0053471       0.0009770      -5.4732144       0.0000000



DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Kelejian-Prucha (error)           1          22.273          0.0000
Pinkse (error)                    1         877.230          0.0000
Pinkse-Slade (error)              1         377.509          0.0000

================================ END OF REPORT =====================================

There is a pretty significant association between solvent access and the TbpA binding interface and our positively selected sites. As a sort of negative control, we observe positive selection close to the Fe-binding site.

So let's try a regression where the dependent variable is continuous:

In [17]:
y = np.array(df['conserved'])
x = np.array(df[columns])
m = OLS(y, x, w=w, name_y='selection', name_x=columns)
print(m.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   selection                Number of Observations:         698
Mean dependent var  :      0.9117                Number of Variables   :           4
S.D. dependent var  :      0.1863                Degrees of Freedom    :         694
R-squared           :      0.1435
Adjusted R-squared  :      0.1398
Sum squared residual:      20.725                F-statistic           :     38.7509
Sigma-square        :       0.030                Prob(F-statistic)     :   3.678e-23
S.E. of regression  :       0.173                Log likelihood        :     236.970
Sigma-square ML     :       0.030                Akaike info criterion :    -465.940
S.E of regression ML:      0.1723                Schwarz criterion     :    -447.747

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       0.9027318       0.0157495      57.3181461       0.0000000
distance_to_interface       0.0030399       0.0006872       4.4237379       0.0000113
      distance_to_fe       0.0014898       0.0010739       1.3873124       0.1657919
              access      -0.2804807       0.0289132      -9.7007881       0.0000000
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER            6.453

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2         487.265           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                3         143.576           0.0000
Koenker-Bassett test              3          65.069           0.0000
================================ END OF REPORT =====================================

Very conserved residues are deeper in the protein (less access).

In [ ]: